10 Jun 2025
I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
These materials are generated by Gerko Vink, who holds the copyright. The intellectual property belongs to Utrecht University. Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.
Warning
You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:
Materials
Let’s start with the core:
Statistical inference
Statistical inference is the process of drawing conclusions from truths
Truths are boring, but they are convenient.
\(^1\) See Jelke Bethlehem’s CBS discussion paper for an overview of the history of survey sampling
Without any data we can still come up with a statistically valid answer.
Some sources of information can already tremendously guide the precision of our answer.
In Short
Information bridges the answer to the truth. Too little information may lead you to a false truth.
Good questions to ask yourself
Hmmm…
Would that mean that if we simply observe every potential unit, we would be unbiased about the truth?
The problem is a bit larger
We have three entities at play, here:
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information
All these things are related to uncertainty. Our model can still yield biased results when fitted to \(\infty\) features. Our inference can still be wrong when obtained on \(\infty\) cases.
The problem is a bit larger
We have three entities at play, here:
The more features we use, the more we capture about the outcome for the cases in the data
The more cases we have, the more we approach the true information
Core assumption: all observations are bonafide
When we do not have all information …
In some cases we estimate that we are only a bit wrong. In other cases we estimate that we could be very wrong. This is the purpose of testing.
The uncertainty measures about our estimates can be used to create intervals
An intuitive approach to evaluating an answer is confidence. In statistics, we often use confidence intervals. Discussing confidence can be hugely informative!
If we sample 100 samples from a population, then a 95% CI will cover the true population value at least 95 out of 100 times.
Neyman, J. (1934). On the Two Different Aspects of the Representative Method: The Method of Stratified Sampling and the Method of Purposive Selection.
Journal of the Royal Statistical Society Series A: Statistics in Society, 97(4), 558-606.
We can replicate our sample.
Full sampling validation of a model’s inferences is a lot of work.
Under some general assumptions, we can use the same data to validate our model’s inferences and predictions.
Take the following definition:
Assumptions are a statisticians faith. It is often impossible to prove that they hold in practice, but we choose to believe that they do.
Sensitivity analyses
I often use computational evaluation techniques to quantify the scope of the impact of assumptions made. For example, we can test the effect of violating assumptions on our results. We then verify if the inferences are sensitive to violations of the assumptions. We can even verify the extend of when assumptions start becoming influential to our inferences.
Whenever I evaluate something, I tend to look at three things:
As a function of model complexity in specific modeling efforts, these components play a role in the bias/variance tradeoff
Individual intervals can also be hugely informative!
Individual intervals are generally wider than confidence intervals
Be careful
Narrower intervals mean less uncertainty.
It does not mean that the answer is correct!
36 years ago, on 28 January 1986, 73 seconds into its flight and at an altitude of 9 miles, the space shuttle Challenger experienced an enormous fireball caused by one of its two booster rockets and broke up. The crew compartment continued its trajectory, reaching an altitude of 12 miles, before falling into the Atlantic. All seven crew members, consisting of five astronauts and two payload specialists, were killed.
In the decision to proceed with the launch, there was a presence of dark data. And no-one noticed!
This missing information has the potential to mislead people. The notion that we can be misled is essential because it also implies that artificial intelligence can be misled!
If you don’t have all the information, there is always the possibility of drawing an incorrect conclusion or making a wrong decision.
We now have a new problem:
What would be a simple solution to allowing for valid inferences on the incomplete sample? Would that solution work in practice?
There are two sources of uncertainty that we need to cover when analyzing incomplete data:
A straightforward and intuitive solution for analyzing incomplete data in such scenarios is multiple imputation (Rubin, 1987).
Rubin, D. B. (1987). Multiple imputation for nonresponse in surveys. John Wiley & Sons.
I’m really sorry!
In practice we don’t know if we did well, because we often lack the necessary comparative truths.
For example:
Let’s assume that we have an incomplete data set and that we can impute (fill in) the incomplete values under multiple candidate models
Challenge
Imputing this data set under one model may yield different results than imputing this data set under another model. Identify the best model!
Problem
We have no idea about validity of either model’s results: we would need either the true observed values or the estimand before we can judge the performance and validity of the imputation model.
Not all is lost
We do have a constant in our problem: the observed values
Cai, M., van Buuren, S., & Vink, G. (2022). Graphical and numerical diagnostic tools to assess multiple imputation models by posterior predictive checking.
Lagaay, A. M., J. C. Van der Meij, and W. Hijmans. 1992. “Validation of Medical History Taking as Part of a Population Based Survey in Subjects Aged 85 and over.” British Medical Journal 304 (6834): 1091–2.
Van Buuren, S. (2018). Flexible imputation of missing data. CRC press. Chapter 9.2
Van Buuren, S. (2018). Flexible imputation of missing data. CRC press. Chapter 9.2
Van Buuren, S. (2018). Flexible imputation of missing data. CRC press. Chapter 9.2
In a survey about research integrity and fraud we surveyed behaviours and practices in the following format.
Many behaviours were surveyed over multiple groups of people. Some findings:
Not Applicable’s over features to allow for a pattern-wise analysis (stratified analysis).Not Applicables to allow for listwise deletion.We know:
Not Applicable is not randomly distributed over the data. Removing them is therefore not valid!Not Applicable are bonafide missing values: there should be no observations.There’s no such thing as a free lunch
Every imputation will bias the results. For some we know the direction of the bias, for some we have no idea. We do not have access to the truth.
We chose to impute the data as 1 (never). There are a couple of reasons why we think that this is the best defendable scenario.
Never has a semantic similarity to a behaviour not being applicable. However, Never implies intentionality; Not Applicable does not.Never will underestimate intentional behaviours.In this case the choice was made to make a deliberate error. The estimates obtained would serve as an underestimation of true behaviour and can be considered a lower bound estimation.
Everything is a missing data problem
All models are wrong, but some are useful
How wrong can a model be to still be useful?
Dark data are concealed from us, and that very fact means we are at risk of misunderstanding, of drawing incorrect conclusions, and of making poor decisions.
| No | Description | No | Description |
|---|---|---|---|
| 1 | Data We Know Are Missing | 9 | Summaries of Data |
| 2 | Data We Don’t Know are Missing | 10 | Measurement Error and Uncertainty |
| 3 | Choosing Just Some Cases | 11 | Feedback and Gaming |
| 4 | Self-Selection | 12 | Information Asymmetry |
| 5 | Missing What Matters | 13 | Intentionally Darkened Data |
| 6 | Data Which Might Have Been | 14 | Fabricated and Synthetic Data |
| 7 | Changes with Time | 15 | Extrapolating beyond Your Data |
| 8 | Definitions of Data | 16 | Data not yet observable |
Missing or dark data can occur for a lot of reasons. Or for no reason at all. For example
Missing data can severely complicate interpretation and analysis
We assume we know where the missing data are
Cases where the assumption does not hold:
\[ P(R|Y^\mathrm{obs}, Y^\mathrm{mis}, X, \psi) = P(R|\psi) \]
David Hand calls this mechanism Not Data Dependent
\[ P(R|Y^\mathrm{obs}, Y^\mathrm{mis}, X, \psi) = P(R|Y^\mathrm{obs}, X, \psi) \]
David Hand calls this mechanism Seen Data Dependent
\[ P(R|Y^\mathrm{obs}, Y^\mathrm{mis}, X, \psi) \]
does not simplify
David Hand calls this mechanism Unseen Data Dependent
Schafer, J. L. & Graham, J. W. (2002). Missing Data. Psychological Methods, 7 (2), 147-177. doi: 10.1037/1082-989X.7.2.147
Prevent unintended missing data
Ad-hoc methods make strong assumptions
| Unbiased | Standard Error | |||
| Mean | Reg Weight | Correlation | ||
| Listwise | MCAR | MCAR | MCAR | Too large |
| Pairwise | MCAR | MCAR | MCAR | Complicated |
| Mean | MCAR | – | – | Too small |
| Regression | MAR | MAR | – | Too small |
| Stochastic | MAR | MAR | MAR | Too small |
| LOCF | – | – | – | Too small |
| Indicator | – | – | – | Too small |
Weighting minimizes bias with unit nonresponse
For inferences purposes, proper imputation strategies prove to quickle become more efficient and more accurate than weighting strategies (Boeschoten et al., 2017).
Boeschoten, L., Vink, G., & Hox, J. J. C. M. (2017). How to Obtain Valid Inference under Unit Nonresponse? Journal of Official Statistics, 33(4), 963-978. https://doi.org/10.1515/jos-2017-0045
Maximum likelihood: The royal road to missing data
Multiple imputation is an all-round principled method
The goal:
We are not interested in whether the imputed value corresponds to its true counterpart in the population, but we rather sample plausible values that could have been from the posterior predictive distribution
Let our analysis model be
with output
Call:
lm(formula = hgt ~ age + tv)
Residuals:
Min 1Q Median 3Q Max
-24.679 -5.134 -0.398 5.175 23.778
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 105.4823 3.4704 30.395 < 2e-16 ***
age 3.8430 0.3262 11.782 < 2e-16 ***
tv 0.4919 0.1278 3.849 0.000155 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 8.389 on 221 degrees of freedom
(524 observations deleted due to missingness)
Multiple R-squared: 0.7742, Adjusted R-squared: 0.7721
F-statistic: 378.8 on 2 and 221 DF, p-value: < 2.2e-16
generated on 224 cases. The full data size is
To impute and analyze the same model with mice, we can simply run:
boys %>%
mice(m = 5, method = "cart", printFlag = FALSE) %>%
complete("all") %>%
map(~.x %$% lm(hgt ~ age + tv)) %>%
pool() %>%
summary() term estimate std.error statistic df p.value
1 (Intercept) 71.5467019 0.61617119 116.114975 735.4492 0.000000e+00
2 age 7.0475726 0.09475359 74.377898 75.9456 1.043045e-72
3 tv -0.5577935 0.09163996 -6.086793 39.9114 3.598196e-07
We have used mice to obtain draws from a posterior predictive distribution of the missing data, conditional on the observed data.
The imputed values are mimicking the sampling variation and can be used to infer about the underlying TDGM, if and only if:
Instead of drawing only imputations from the posterior predictive distribution, we might as well overimpute the observed data.
miceboys %>%
mice(m = 5, method = "cart", printFlag = FALSE, where = matrix(TRUE, 748, 9)) %>%
complete("all") %>%
map(~.x %$% lm(hgt ~ age + tv)) %>%
pool() %>%
summary() term estimate std.error statistic df p.value
1 (Intercept) 71.4727297 0.7637067 93.586620 71.79955 8.889816e-77
2 age 6.8882608 0.1210342 56.911699 21.81539 3.232474e-25
3 tv -0.4038602 0.1028084 -3.928281 33.89849 3.989687e-04
But we make an error!
Rubin (1987, p76) defined the following rules:
For any number of multiple imputations \(m\), the combination of the analysis results for any estimate \(\hat{Q}\) of estimand \(Q\) with corresponding variance \(U\), can be done in terms of the average of the \(m\) complete-data estimates
\[\bar{Q} = \sum_{l=1}^{m}\hat{Q}_l / m,\]
and the corresponding average of the \(m\) complete data variances
\[\bar{U} = \sum_{l=1}^{m}{U}_l / m.\]
Rubin, D.B. (1987). Multiple Imputation for Nonresponse in Surveys. New York: John Wiley and Sons.
Simply using \(\bar{Q}\) and \(\bar{U}_m\) to obtain our inferences would be to simplistic. In that case we would ignore any possible variation between the separate \(\hat{Q}_l\) and the fact that we only generate a finite set of imputations \(m\). Rubin (1987, p. 76) established that the total variance \(T\) of \((Q-\bar{Q})\) would equal
\[T = \bar{U} + B + B/m,\]
Where the between imputation variance \(B\) is defined as
\[B = \sum_{l=1}^{m}(\hat{Q}_l - \bar{Q})^\prime(\hat{Q}_l - \bar{Q}) / (m-1)\]
This assumes that some of the data are observed and remain constant over the synthetic sets
The total variance \(T\) of \((Q-\bar{Q})\) should (Reiter, 2003) equal
\[T = \bar{U} + B/m.\]
Reiter, J.P. (2003). Inference for Partially Synthetic, Public Use Microdata Sets. Survey Methodology, 29, 181-189.
boys %>%
mice(m = 5, method = "cart", printFlag = FALSE, where = matrix(TRUE, 748, 9)) %>%
complete("all") %>%
map(~.x %$% lm(hgt ~ age + tv)) %>%
pool(rule = "reiter2003") %>%
summary() term estimate std.error statistic df p.value
1 (Intercept) 71.3427191 0.67872263 105.113218 4670.20788 0.000000e+00
2 age 6.9161820 0.09961066 69.432144 178.28359 5.318364e-131
3 tv -0.4015052 0.09691269 -4.142958 56.62364 1.155999e-04
Thank back about the goal of statistical inference: we want to go back to the true data generating model.
The multiplicity of the solution allows for smoothing over any Monte Carlo error that may arise from generating a single set.
mira <- boys %>%
mice(m = 6, method = "cart", printFlag = FALSE, where = matrix(TRUE, 748, 9)) %>%
list('1' = rbind(complete(., 1), complete(., 2)),
'2' = rbind(complete(., 3), complete(., 4)),
'3' = rbind(complete(., 5), complete(., 6))) %>% .[-1] %>%
data.table::setattr("class", c("mild", class(.))) %>%
map(~.x %$% lm(hgt ~ reg))
mira %>% pool(rule = "reiter2003") %>%
summary() %>% tibble::column_to_rownames("term") %>% round(3) estimate std.error statistic df p.value
(Intercept) 152.014 3.746 40.582 112.526 0
regeast -17.815 4.461 -3.993 771.076 0
regwest -23.092 4.397 -5.252 95.353 0
regsouth -27.579 4.451 -6.196 177.098 0
regcity -25.928 6.111 -4.243 19.899 0
mira %>% pool(rule = "reiter2003",
custom.t = ".data$ubar * 2 + .data$b / .data$m") %>%
summary() %>% tibble::column_to_rownames("term") %>% round(3) estimate std.error statistic df p.value
(Intercept) 152.014 5.118 29.703 112.526 0.000
regeast -17.815 6.228 -2.860 771.076 0.004
regwest -23.092 5.989 -3.856 95.353 0.000
regsouth -27.579 6.125 -4.503 177.098 0.000
regcity -25.928 7.928 -3.270 19.899 0.004
Some adjustment to the pooling rules is neede to avoid p-inflation.
Raab, Gillian M, Beata Nowok, and Chris Dibben. 2018. “Practical Data Synthesis for Large Samples”. Journal of Privacy and Confidentiality 7 (3):67-97. https://doi.org/10.29012/jpc.v7i3.407.
With synthetic data generation and synthetic data implementation come some risks.
Any idea?
Nowadays many synthetic data cowboys claim that they can generate synthetic data that looks like the real data that served as input.
This is like going to Madam Tusseaud’s: at face value it looks identical, but when experienced in real life it’s just not the same as the living thing.
Many of these synthetic data packages only focus on marginal or conditional distributions. With mice we also consider the inferential properties of the synthetic data.
In general, we argue [^4] that any synthetic data generation procedure should
Volker, T.B.; Vink, G. Anonymiced Shareable Data: Using mice to Create and Analyze Multiply Imputed Synthetic Datasets. Psych 2021, 3, 703-716. https://doi.org/10.3390/psych3040045
When valid synthetic data are generated, the variance of the estimates is correct, such that the confidence intervals cover the population (i.e. true) value sufficiently [^5]. Take e.g. the following proportional odds model from Volker & Vink (2021):
| term | estimate | synthetic bias |
synthetic cov |
|---|---|---|---|
| age | 0.461 | 0.002 | 0.939 |
| hc | -0.188 | -0.004 | 0.945 |
| regeast | -0.339 | 0.092 | 0.957 |
| regwest | 0.486 | -0.122 | 0.944 |
| regsouth | 0.646 | -0.152 | 0.943 |
| regcity | -0.069 | 0.001 | 0.972 |
| G1\(|\)G2 | -6.322 | -0.254 | 0.946 |
| G2\(|\)G3 | -4.501 | -0.246 | 0.945 |
| G3\(|\)G4 | -3.842 | -0.244 | 0.948 |
| G4\(|\)G5 | -2.639 | -0.253 | 0.947 |
Volker, T.B.; Vink, G. Anonymiced Shareable Data: Using mice to Create and Analyze Multiply Imputed Synthetic Datasets. Psych 2021, 3, 703-716. https://doi.org/10.3390/psych3040045
Gerko Vink @ Anton de Kom Universiteit, Paramaribo